Skip to content

feat(ai): add 50-token overlap between embedding splits#35367

Merged
wezell merged 3 commits intomainfrom
issue-35366-overlap-embedding-tokens
Apr 23, 2026
Merged

feat(ai): add 50-token overlap between embedding splits#35367
wezell merged 3 commits intomainfrom
issue-35366-overlap-embedding-tokens

Conversation

@wezell
Copy link
Copy Markdown
Member

@wezell wezell commented Apr 17, 2026

Summary

  • Modifies EmbeddingsRunner to carry a ~50-token overlap from the end of each chunk into the start of the next
  • Replaces the single StringBuilder buffer with parallel List<String> sentences / List<Integer> tokenCounts lists to enable backward traversal for overlap computation
  • Adds ArrayList import

Motivation

Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.

Test plan

  • Verify embeddings are generated without errors for a contentlet with long text (>splitAtTokens)
  • Confirm adjacent chunks share ~50 tokens of trailing text from the previous chunk
  • Confirm no regression for short content (fewer tokens than splitAtTokens) — still saved as a single chunk

Closes #35366

🤖 Generated with Claude Code

This PR fixes: #35366

Now overlaps 15-20%
⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed:

  - Before: on each split, buffer was cleared entirely — new chunk
started with only the sentence that triggered the overflow.
  - After: on each split, the code walks backward through the sentence
list to collect trailing sentences totaling ≥50 tokens, carries those
into the next chunk as overlap, then resets totalTokens to
   the overlap size.

ref: #35366
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 17, 2026

Claude finished @wezell's task in 1m 9s —— View job


Code Review

  • Read diff and full file
  • Analyze for bugs, edge cases, design issues

Three issues worth addressing before merge:

1. Config inconsistency — overlapTokens bypasses per-host configuration

splitAtTokens is read via embeddingsAPI.config.getConfigInteger(AppKeys.EMBEDDINGS_SPLIT_AT_TOKENS) (per-app/host config), but overlapTokens uses Config.getIntProperty(...) — a global JVM-level property. This means the overlap can't be tuned per dotCMS site the way the split threshold can. It should either be added to AppKeys and read through the same embeddingsAPI.config mechanism, or at minimum documented as intentionally global.

2. Missing guard: overlapTokens >= splitAtTokens causes degenerate behavior

EmbeddingsRunner.java:94

while (overlapStart > 0 && overlapCount < overlapTokens) {

If someone sets overlapTokens close to or above splitAtTokens, the overlap carried into the next chunk will immediately trigger another split on the very first sentence added, producing near-empty or near-duplicate chunks in a tight loop. There's no validation that overlapTokens < splitAtTokens. A guard + log warning at line 65 would prevent misconfiguration.

3. A single oversized sentence can silently blow past the overlap target

The while loop exits when overlapCount >= overlapTokens OR when overlapStart == 0. If the very first sentence walked back has 300 tokens, the overlap becomes 300 tokens — 6× the configured limit. For short-sentence prose this is fine, but for structured content (tables, code blocks treated as one "sentence" by BreakIterator) the overlap can approach splitAtTokens, making the next chunk nearly a full duplicate of the previous one. Worth at minimum a Logger.debug when overlapCount > overlapTokens * 2.


One subtlety that's correct but fragile:

hasUnsavedContent (EmbeddingsRunner.java:75, 86, 107) exists to prevent double-saving the overlap sentences after the final split. It's correct, but sentences serves dual purpose (accumulation buffer + overlap buffer), which a future maintainer could break by removing the flag after seeing !sentences.isEmpty() as sufficient. A comment explaining why the flag exists alongside sentences (i.e., "sentences may hold overlap-only content that was already saved") would prevent that regression.


No functional bugs found — the core logic (backward traversal for overlap, clearing lists after save, totalTokens = overlapCount for continuity) is sound. The issues above are defensive-correctness gaps, not correctness failures.

@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Apr 17, 2026
wezell and others added 2 commits April 22, 2026 10:38
When a content's sentence stream ended exactly on a chunk boundary, the
trailing block re-saved the overlap subset as a standalone chunk —
duplicate content that embeddingExists() couldn't detect (it compares
full normalized chunk text). Track unsaved content explicitly and only
emit the final chunk when new sentences have been added since the last
save.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wezell wezell added this pull request to the merge queue Apr 22, 2026
Merged via the queue into main with commit 9ca800f Apr 23, 2026
49 checks passed
@wezell wezell deleted the issue-35366-overlap-embedding-tokens branch April 23, 2026 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Improve our embedding splitting

2 participants